Financial Contribution to Presidential Campaign of California, 2016

WEI MA

August 30, 2016

========================================================

Introduction

California is based on the west coast of the United States. It is the most populous state as well as the third extensive by area. In additon to the properity of film industry in Hollywood, a large amount of high-tech companies sprouting up all across the silicon valley make california one of the largest economic entity in the world. There 58 counties and 482 municipalities in california. It is normally conceived that the political atmosphere is liberal rather than conservative. By scrutinizing the financial contribution to presidential campaign with california, we can have a better understanding of the politicla geography of california. Does the people contribute more to democratical candidates than republican candidates? Which condidates have the most financial support from california? What are the propertities of candidates make residents in california tend to support him or her? These questions would be investigated in this project.

Univariate Plots Section

have a look at the data frame

## Classes 'tbl_df', 'tbl' and 'data.frame':    542729 obs. of  24 variables:
##  $ contbr_city                 : chr  "SAN FRANCISCO" "SANTA BARBARA" "LOS ANGELES" "IRVINE" ...
##  $ cand_id                     : Factor w/ 24 levels "P00003392","P20002671",..: 12 12 12 12 12 12 12 12 12 12 ...
##  $ contbr_zip                  : Factor w/ 85150 levels "","00000","000090272",..: 53916 41095 1547 33840 11562 23445 66550 71648 71648 71648 ...
##  $ contbr_employer             : Factor w/ 34054 levels ""," APPLE INC.",..: 4793 18836 26432 4109 11816 9835 20672 24074 24074 24074 ...
##  $ county                      : chr  "SAN FRANCISCO" "SANTA BARBARA" "LOS ANGELES" "ORANGE" ...
##  $ county_fips                 : int  6075 6083 6037 6059 6037 6073 6013 6087 6087 6087 ...
##  $ cand_nm                     : Factor w/ 24 levels "Bush, Jeb","Carson, Benjamin S.",..: 19 19 19 19 19 19 19 19 19 19 ...
##  $ zip_ref                     : int  94118 93111 90025 92618 90404 91910 94804 95062 95062 95062 ...
##  $ contbr_occupation           : Factor w/ 15688 levels ""," REAL ESTATE BROKER",..: 11517 1989 15007 4682 11276 4682 9114 11854 11854 11854 ...
##  $ zip_city                    : Factor w/ 1064 levels "","ACAMPO","ACTON",..: 825 850 526 429 857 187 781 853 853 853 ...
##  $ contbr_nm                   : Factor w/ 100016 levels "_BOOTH, ELAINE S.",..: 4881 12668 12678 12681 12694 899 5959 13198 13198 13198 ...
##  $ contb_receipt_dt            : Factor w/ 487 levels "01-APR-15","01-APR-16",..: 452 452 452 452 452 452 452 452 452 452 ...
##  $ contb_receipt_amt.individual: num  10 10 35 200 33.5 ...
##  $ cand_pty_aff                : Factor w/ 4 levels "Democratic Party",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ cand_gender                 : chr  "male" "male" "male" "male" ...
##  $ city.pop                    : int  723724 86115 3911500 199755 89112 221736 103818 54583 54583 54583 ...
##  $ city.lat                    : num  37.8 34.4 34.1 33.7 34 ...
##  $ city.long                   : num  -122 -120 -118 -118 -118 ...
##  $ total.amt                   : num  1.7e+07 1.7e+07 1.7e+07 1.7e+07 1.7e+07 ...
##  $ cand.n                      : int  307483 307483 307483 307483 307483 307483 307483 307483 307483 307483 ...
##  $ contbr_region               : chr  "SILICON VALLEY" "WEST CA" "WEST CA" "SOUTH CA" ...
##  $ region.pop                  : num  6787417 11504823 11504823 10794239 11504823 ...
##  $ contbr_region.n             : int  164862 161614 161614 104188 161614 104188 164862 164862 164862 164862 ...
##  $ contbr_county.n             : int  36662 8964 132553 35173 132553 42239 23521 10890 10890 10890 ...
## [1] 542729     24

There are 542729 observations and 24 variables.

It is obvious that the single transaction amount is not normally distributed from both the histogram on the left and the qqplot on the right. Also, I would like to conduct shapiro test by obtaining a sample size of 5000 from the population.

## 
##  Shapiro-Wilk normality test
## 
## data:  sample0
## W = 0.58859, p-value < 2.2e-16

In shapiro test, the null hypothesis is that is sample is from a normally distributed population, and the alternative hypothesis is that the population is not normally distributed. The p value of shapiro test is less than 2.2e-16, and less than 0.05, in which case we reject the null. The population is not normally distributed. Then I would like to conduct a log10 scale transmission in single transaction amount axis.

After transmission, the plot looks like normally distributed. Now conduct shapiro test on log scale of the single transaction contribution amount.

## 
##  Shapiro-Wilk normality test
## 
## data:  sample1
## W = 0.94799, p-value < 2.2e-16

As p-value is still less than 2.2e-16, the null hypothesis that the population is normal is rejected, which means the log scale of the population is still not normally distributed.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.01    15.00    30.00   143.50   100.00 10800.00

The single transaction amount is highly skewed, it is better to transform the amount to log10 scale to make it to a near normal distribution. Most single transaction amount falls between 15 to 100.

Now I would have a look at the single transaction amount distribution for different parties by faceting the distribution by candidates’ party.

It looks like suppoerters of republican party would like to contribute more in an individual transaction, where as democratic party’s supporters contribute less in an individual transaction. To have a better look at distribution of single transaction amount varied between parties, a frequency polygon is plotted below. The y axis is density, which is the percentage of the supportors within a party’s supportors gave the specific single amount.

## fc_v7$cand_pty_aff: Democratic Party
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.01    15.00    27.00   119.40    54.00 10000.00 
## -------------------------------------------------------- 
## fc_v7$cand_pty_aff: Green Party
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    35.0   100.0   141.8   250.0  1000.0 
## -------------------------------------------------------- 
## fc_v7$cand_pty_aff: Libertarian Party
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    25.0   100.0   250.0   428.8   500.0  2700.0 
## -------------------------------------------------------- 
## fc_v7$cand_pty_aff: Republican Party
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.01    25.00    50.00   238.00   100.00 10800.00

As the single trnasaction amount is highly skewed, median is better indicator for the single amount rather than the mean amount. For democratic party, most single transaction amounts fall within 15 dollars to 54 dollars, and the median is 27 dolloars. For republican party, most single transaction amounts fall within 25 dollars to 100 dollars and median is 50 dollars, supportors of republican party tend to contributes more in single transaction than supportors for democratic party. Supporters for green party tend to contribute 35 dollars to 250 dollars in a single transaction mostly and supporters for libertarian tend to contribute 100 dollars to 500 dollars in a single transaction. In conclusion, supportors for libertarian party contributed the largest amount in a single transaction, follwed by green party, and then republican party. Supporters of democratic gave the smallest amount in a single transaction. Also, we can see the largest single transacton amount 10800 dollars goes to the republican party.

Now I wonder which party received the largest total number of transactions. Because the values of the transaction number span over several orders of magnitude, log scale is used to make the smaller values visible. When plotting the transaction number, I put original transaction number bar graph on the left and the transformed log scale count on the right.

Democratic party has the largest transaction number, and republican party follows the democratic party, has the second largest transaction number. It is obcious that green party and libertarian party have way less transaction number compared to republican and democratic party.

According to the analysis for single transaction amount distribution and bar graph above, we already know a basic fact about the two main parties democratic party and republican party. Single transaction amount for republican party is relatively larger than for democratic party, whereas the total number of transactions for democratic party is way more larger than the total number of transactions for republican party.

At this point, I would like to look into two main parties about the total number of transactions for individual candidates. I wonder which individual has the most transaction number in democratic and republican party? Firstly, bar graph for democratic party is drawn.

Secondly, bar graphs of transaction number of republican party on both raw scale and log scale are drawn.

In democratic party, Sanders, Bernard has the most transaction number, follwed by Hillary Clinton, whereas in republican party, most people contributed to Ted Cruz, second largest transaction number goes to Ben Carson, and the third largest transaction number goes to Marco Rubio.

## 
##                 Bush, Jeb       Carson, Benjamin S. 
##                      2981                     26921 
##  Christie, Christopher J.   Clinton, Hillary Rodham 
##                       323                    123824 
## Cruz, Rafael Edward 'Ted'            Fiorina, Carly 
##                     52246                      4637 
##     Gilmore, James S IIII        Graham, Lindsey O. 
##                         3                       304 
##            Huckabee, Mike             Jindal, Bobby 
##                       520                        31 
##             Johnson, Gary           Kasich, John R. 
##                        28                      2852 
##          Lessig, Lawrence   O'Malley, Martin Joseph 
##                       371                       395 
##         Pataki, George E.                Paul, Rand 
##                        20                      4220 
##    Perry, James R. (Rick)              Rubio, Marco 
##                       109                     13137 
##          Sanders, Bernard      Santorum, Richard J. 
##                    307483                        82 
##               Stein, Jill          Trump, Donald J. 
##                       153                      1360 
##             Walker, Scott     Webb, James Henry Jr. 
##                       623                       106

Now have closer look at democratic party and republican party. Focus on 7 main presidential candidates: “Clinton, Hillary Rodham”,“Sanders, Bernard”,“Bush, Jeb”,“Carson, Benjamin S.”,“Trump, Donald J.”,“Rubio, Marco”,“Cruz, Rafael Edward ‘Ted’”.

Although democratic party has less candidates than republican party, the transaction number for democratic party is larger than the transaction number for the republican party.

Also, I would like to have a look at the transaction number from different regions.

## Source: local data frame [7 x 2]
## 
##    contbr_region contbr_region.n
##            (chr)           (int)
## 1 SILICON VALLEY          164862
## 2        WEST CA          161614
## 3       SOUTH CA          104188
## 4       NORTH CA           69290
## 5      JEFFERSON           17712
## 6     CENTRAL CA           24538
## 7             NA             525

In the data set, most transactions comes from silicon valley, the second leargest transaction number comes from west california, the third largest transaction number comes from south califronia, followed by north califonia, jefferson, central california.

Univariate Analysis

What is the structure of your dataset?

There are 542729 transactions of contribution. The single largest transaction is 10800, the single smallest transaction is 0.01. the median single transaction amount is 30.00, and the mean amount transaction is 143.50. Because the simgle amount of transaction is so skewed, the median value is a more important parameters indicating the most common amount of single transaction.

In this data set, there are 24 candidates. Among thess candidates, there are only three of them are females. The only three female candidates are “Clinton, Hillary Rodham”, “Stein, Jill” and “Fiorina, Carly”. All presidential candidates come from 4 parties: Democratic Party, Green Party, Republican Party and Libertarian Party.

5 candidates from Democratic Party: “Sanders, Bernard”, “Clinton, Hillary Rodham”, “Webb, James Henry Jr.”, " O’Malley, Martin Joseph" and “Lessig, Lawrence”.

17 candidates from Republican Party: “Cruz, Rafael Edward ‘Ted’”,“Walker, Scott”,“Bush, Jeb”, “Rubio, Marco”, “Christie, Christopher J.”, “Paul, Rand”, “Kasich, John R.”,“Fiorina, Carly”, “Santorum, Richard J.”, “Jindal, Bobby”,“Huckabee, Mike”, “Trump, Donald J.”,“Pataki, George E.”,“Carson, Benjamin S.”,“Graham, Lindsey O.”, “Perry, James R. (Rick)”, “Gilmore, James S IIII”.

1 candidate from Green Party: Stein, Jill.

1 candidate from Libertarian Party: Johnson, Gary.

These cotributions were made by 104411 contributor. These contributors spread across 58 counties and 1458 cities. They come from 15687 occupations.

Univariate Analysis Observations: * Most single contibution amount falls between 30 dollars to 100 dollars. * Sanders, Bernard has the highest contibution number * Hillary Clinton has the second highest contribution number * Ted Cruz has the third highest contribution number * Republican Party has the lower contribution number, whereas each transaction amount is higher than democratic party. * Democratic party has the higher contribution number, whereas each transaction amount is relatively less than contributions for republican party.

What is/are the main feature(s) of interest in your dataset?

To predict the behavior of a specific contributor, the main feature in the data set would be contributor’s city or county, the contributor’s occupation. Contributors with some properties might be into a presidential candidats with specific properties like candidates’ gender and candidate’s party.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Contributor’s employer might also have an impact on the amount and times of donation. One contributor might have donated several times which makes the total donation amount of a specific contributor relativly higher even though the single transaction amount is small.

Did you create any new variables from existing variables in the dataset?

Firstly, I add the 2 variables form the existing dataset: * One is the total contribution a candidate obtained, which is total.amt. * Another one is the total transaction number a candicate obtained, which is cand.n.

Secondly, it is better to divide california state into 6 regions according to the SIX CALIFORNIA PLAN, I would like to add the region variable to each contributor according to their address like county and city. These 6 region is: Jefferson, North California, Central California, Silcon Valley, West California and South California.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Within the features I investigate, the amount of individual transaction is highly right skewed, I transformed the amount by adding a layer of log10, which made the distribution of single transaction amount to normal distribution. Most single amount falls between 15 dollars and 100 dollars.

Bivariate Plots Section

How does the single transaction amount differ from varies regions? I explore the singl transaction amount through boxplot.

It looks like in overall people from different region gives the same amount in single donation, but we can use coord_cartesian to have closer look.

According to the bar graph above, people from central california and silicon valley gave slightly higher amount in single transaction. The resides region has slight impact on the amount of single donation.

Now move our focus back to the single transaction amount distribution in different parties. Boxplot based on different party is plotted below.

This boxplot validates the fact that the single transaction amount for different parties ranked from the highest to the lowest is: libertarian party > green party > republican party > democratic party, which happens to be the reverse order of the total number of transactions for different parties(obtained in univariate section).

Also, in the univariate section, a bar plot is plotted to show the transaction number from different regions. However, it is more illustrative to incorporate the regional population when it comes to interpreting regional participation of financial contribution for presidential campaign. Then have a look at the participation of financial contribution in different regions by plotting a bar graph below, with region as x axis, population on the y-axis and coloring bars by participation ratio.

Silicon valley has the biggest participation ratio, follwed by jefferson and north california, whereas the participation ratio in central california is the lowest. The regional financial contrbution participation rate does not have a monotonic relationship between the population of the region.

Does the city population and the amount of individual transaction from people reside in that city has a relationship?

## 
##  Pearson's product-moment correlation
## 
## data:  city.pop and contb_receipt_amt.individual
## t = 43.981, df = 359720, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06988166 0.07638242
## sample estimates:
##        cor 
## 0.07313282

There is not a relationship between city population and amount of single transaction.

Now I will focus on 7 main candidates as mentioned in univariate section. I will have a look at the total amount of contribution each candidate received.

Among the main candidates, Hillary collected the largest total amount of financial contribution, followed by Sanders, Bernard, Ted Cruz and Marco Rubio. The transaction number of Bernie Sanders are much higher than Hillary Cliton, at the same time the total amout he received is way less than Hillary Clinton. Why did this happen? Is it because the mean amount for single transcation for Bernie Sanders is way less than the mean amount for Hillary Clinton? Also, it is odd that Donald Trump received so little financial support from california whereas he hold possibly highest support rate in America. He might obtained much contribution from other area or he could have used his own money. It is out of this article’s reach. Main candidates’ single transaction amount is plotted in boxplot below.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The political geography of california is much liberal rather than conservative, people give more mone money to democratic party than republic party. The region of a contributor accounts for little difference in single trnasaction amout, and people from silicon valley and central california did donate slightly more than people from other regions.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

To have a closer look at the political geograph, the support ratio for democratics and republic of different county or region is a concern in later analysis. The second concern is the percentage of total amount each party receiceved from out of the total amount of a county or a region.

What was the strongest relationship you found?

  • Hillary Cliton obtained total amount as twice as Bernie Sanders although her transaction number is much less than bernie’s.
  • Donald Trump received very little money from California.
  • There is no obvious monotonic relaship between the population of region and the participation of a region, nor between the population and the single transaction amount from that region.

Multivariate Plots Section

This is the boxplot of single transaction amount of individual candidates from different party.

Now have a closer look the single transaction amount by focusing on main candidates and contributors’ regions.

It is obvious that the region where a contributor from affects little in the single transaction amount.

The median of all contribuots’ personal donation amount is 278 dollars. The median of Personal total donation amount from supportors of Jeb BUsh, Hillary Clinton and Marco Rubio is larger than 278, whereas supportors for other main candidates tend to give less than median of all contributors’ personal total donation amount. Now I will have a look at whether the ersiding region of contrbutors affect the individual total contribution amount.

The effection of residing region is slightly.

Burnie Sander’s supportors tends to contribute repeatly than other candidates’ supportotrs. Most supportors of Jeb Bush, Donald Trump and Marco Rubio would like to contribute only once.

People from silicon valley, west california, south california contributed more than people from central california, and jefferson. Also, have a look at behavior of two main party’s supportors focusing on the relatonship between how many times one denote and the total amount of money people donate.

Supportors for republican party denotes less times, and they donate more total money with the same times of transaction. Also, inspect the relationship between mean amount and the number of transactions for one in the two main party.

It is true for both party’s supportors that the more time one donate, the smaller amount one donate each time. The behavior difference between different party supportors and residential region is very small, and almost negligible.

It is known from previous sections that the transaction number for Bernie Sanders is more than twice the transaction number of Hillary Clinton’s. However, the total received financial amount of Hillary Clinton is much more than total received amount of Bernie Sanders. Now I will have a closer look at the behaviors of supportors of these two candidates.

It is obvious that when supportors of Bernie Sanders and supportors of Hillary Clinton gave the same total personal contribution, Bernie’s supportors have much higher dnation frequency and much lower mean single transaction amount.

Draw a county polygon by the proportion of number of party supportors out of the total amount people who gave a contribution.

I would also like to look the proportion of contribution by the amount of money people contributed to the main two parties.

Now we get an great political geographic pictures Here. California is over all lean to left-wing and liberal. There is still less than half people goes for republican. People from silicon valley, western california is much liberal, whereas people from south california, jefferson are more conservative.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Among the main candidates, lasrgest single transaction amount is 10800, goes for Ted, Cruz. Second largest single transaction amount is 10000, goes for Bernie Sanders, Jeb Bush, Ben Carson. 2700 is the smallest max single transaction amount, goes for Donald Trump. Take median single transaction amount as an indicator, the songle transaction amount ranked from the lowest to the highest liek below (use median as indicator): Bernie Sanders < Ben Carson = Hillary Clinton = Ted Cruz < Marco Rubio < Donald Trump < Jeb Bush

People support Bernie Sanders would contribute much more times than supportors for any other candidates. The prequency of denoation from different candidates’ supportors is ranked from the highest to the lowest as below (use median as indicator): Bernie Snders > Ted Cruz > Ben Carson = Hillary Clinton > Jeb Bush = Marco Rubio = Donald Trump

The accumulative amount one individual contributor would gave of different candidates ranked from highest to lowest as below(use median as indicator): Jeb Bush > Hillary Clinton=Marco Rubio > Ben Carson = Ted Cruz = Donald Trump > Bernie Sanders

At last, among the main candidates, Hillary Clinton has the largest total received amount, followed by Bernie Sanders. Then Ted Cruz and Marco Rubio tied on the third place of total received financial contribution. Bernie Sanders has the largest transaction number and Hillary Clinton has the second largest transaction number.

People contribute to democratic candidates a total amount of 51595715.4 dollars, among whihc Hillary obtained 65.94% of the total amount and Bernie Sanders obtained 32.96%. People contributed 26268578.7 for republican party. People in California gave almost as twice amount (1.964) of money to democratic party as to repubican party. And Hillary’s received amount alone is more than the total received amount of the main 5 republic candidates: Donald Trump, Ben Carson, Jeb Bush, Marco Rubio and Ted Cruz.

In this anaysis, Califronis is devided to 6 regions, people from Silicon Valley, West California gave more total amount money, whereas people from Central California and Jefferson gave less total amount of money. People from Silicon Valley and West Califronia are much more keen to democratic party, whereas people from Central California, South Califronia are less keen to democratic party.

Were there any interesting or surprising interactions between features?

At frist, I thought people supports Bernie Sanders gave little in every single transaction, then Bernie Sander would not have collected so much money. But it turned to be that he collected the second largest amount of money from Caifornia. One reason is that so many people support him, and another reason is that his supports tend to contribute many time. At 50% of his supportors contrbuted equal or greater than 4 times. There is one named MCLENNAN, MARLYN donated 218 times varied from 1 dollars to 100 dollars, the mean of his single donation is 7.32 dollars. He donated a total amount of 1595.68. It is also surprised that Bernie has the larger accumulate individual amount than Hillary Clinton. Bernie has a 21750 from one person at most, whereas Hillary only has a 10800 from one person at most.


Final Plots and Summary

Plot One

Description One

Over 6 regions in california, 77809347 dollars are donated to two main political parties: republican and democratic parties. And 66.3% is for democratic party whereas 33.7% goes for democratic party. California’s political geograph is more left-wing rather than right-wing. The approval ratio of democratic over that of republican is higher in silicon valley and west california, whereas the result is much lower in other areas, especially in south california.

Plot Two

Description Two

Among the main candidates, 1st quantile of personal contribution amount is 127, median is 278 and 3rd quantile is 750. The largest personal contribution is 35100, goes to Ted Cruz and smallest personal contribution goes to Donal Trump, which is 8100. Bernie Sander’s supportors trnd to contribute repeatedly much more than other candidate’s supportors. The average transaction amount of an individual contributor supporting different different candidates is ranked from lowest to the highest as below(median as indicator): Bernie Sanders < Ted Cruz < Ben Carson < Hillary Clinton < Donald Trump < Marco Rubio < Jeb Bush.

It is obvious that Bernie sanders’ supportors tends to contribute most times but least amount in single transaction. Ted Cruz’s supportors tends to contribute second most times, the median contrition times is 3 and to contribute second least amount in single trnasaction, the median amount of single transaction is 75. Hillary Clinton and Ben Carson’s supportors tends to contribute 2 times and Clinton’s suppotors gives much more than Ben Carson’s supportors in a single transaction. The median of average single transaction of a Hillary Clinton’s supportor is 230, whereas Ben Carson’s supportor gave only 85 dollars every time he contributes. Supportors for Donald Trump, Marco Rubio and Jeb Bush would like to contribute once most of the time, but each time they give a relative high amount, especially supportors of Jeb Bush. The median of Jeb Bush supports’ individual mean single transaction amount is 2700 dollars, which is the highest among the 7 main candidates.

Plot Three

Political Geography of California

Description Three

The plot shows political preference by using two indicators. The left graph uses supportors’ population as an indicator, whereas the right graph uses the amount of financial contribution each party received as an indicator. Blues denote pro-democratic and Reds denote pro-republican. The higher the level is, the higher the degree of preference is. Dark blue means much more prefer to democratic party than light blue, and dark red means much more prefer to republican party than light red. Regions without apparent prefernce or are relatively neutral are colored with gray or near white light blues/reds.

The most pro-democrtic region is silicon valley, and the second region is north california from both people and money perspectives. Jefferson and west califronia are still pro-democratic but with relatively less ratio over the republican party. The most pro-republican region is central california. South California is kind of in the middle with slight pro to democratic.

The most pro-democratic county comes from silicon valley, and the second most pro-democritc county comes from jefferson, which is not an overall second most pro-democratic region.

THe most pro-republican county comes from central california, which is also the most pro-republican region. The second most pro-republican county comes from jefferson, which is not a second most pro-republican region.

California is overall pro-democratic, which is obvious by viewing the large area of blue on the politicla geography map. However, most counties from central california and a specific county from jefferson do prefer republican party with a noticeable ratio.

Reflection

The data set I focused on contains 542729 records of financial contributions across 24 variables. By univariate plot and analysis, I obtain an overview of the data set. By plotting histogram of single transaction amount, it is interpreted that the single transaction amount is highly skewed, that’s why I make a log10 transimission in terms of the amount of money. After transmission, the single transaciton amount is nearly normal distributon. I understood that there are outliers in single trnasaction amount and also the transaction number, that’s why in later analysis I mostly inspect data elements within 0.99 quantile range. In the bivariable analysis phase, I continuned to explore my ideas and intersts by inspecting relationships between both qualitative-quantitative pairs and quantitative-quantitative pairs. Through boxplot, I visulized import statistics like mean, medinan for categorical variables. In multivariable phase, I contnued to inspect the relationships oberved earlier by add more variables to see whether they were any lurking variables that would affect the relationship heavily. When I was trying to obtain an smooter line for the relationship between personal contribution amount and personal contribution times, it was diffidult to obtain an illustrative graph. It was because the geom_point plot was overplotting, by adjusting the x axis and taking a log10 transmission, I successed to obtain a more illustrative graph with estimated smoother lines. The issue of overplotting also affectted many other plots, I overcomed it by adding transparency, axis transimission and use different categorical colors.

There are some limitations in the data analysis. In this analysis, the latest contributor receipt date is April 30th int eh year 2016. This data might be incomplete and is not updated if there are more contibutions afer this latest time. Besides, the outcome maybe biased because I ignored all the refund transactions in this project. When I analyze the individual contributor, I use contbr_nm and contbr_zip as the unique contributor’s locator. There might be people with same name and reside in the same district. More inspection with whether a name is shared by different people is recommended.

I also come up with some future ideas and intersets. First of all, I would like to inveatigate contributors behavior’s change along with time. Did the contribution amount and contribution frequency increase or decrese along with time as well as social public events happened around the time they made a donation. Secondly, I would like to inspect the relationship bertween people’s contribution behavior and their own social-sconomic status. The social-econoic status should be defined and categorized by considering the combinaiton of their occupation, employer, income, residing region and other related variables. In this anaylysis, only contributors’ region was inspected. More variables are suggested to be investigated later to understand people’s behaviour. It is also suggested to scrape and collect other features such as people’s income in the future.

Along with heat up of the presidential campaignship, unpredented incidents keep coming out. These incidetns would affect the financial contribution heavily and even drive the political geography dramatically. When interpreting the political geography in california of this article, it is important to understand that the political geography is not consistent all the time, it varies, fluctuates, and sometimes even reverses along with the election goes on. It is recommended to intrepret the outcome from this article with reference to more recently reviewed and updated information to make a more objective judgement and prediction.